Intro to R: A hands-on tutorial

Day 0: Intro to statistical programming

Sarah Strochak, Kyle Ueyama, Aaron R. Williams

R Lunch Lab

Statistical Programming

Motivation: why statistical programming?

  1. Clearly answer questions
  2. Clearly communicate the answer to questions
  3. Document the steps to answering questions

Example 1

What is 2 + 2?

Example 1

What is 2 + 2?

## [1] 4

Example 2

What is the median price of diamonds with carat > 1 and a Good cut?

Example 2

What is the median price of diamonds with carat > 1 and a Good cut?

## # A tibble: 1 x 1
##   `median(price)`
##             <int>
## 1            6412

Example 3

How could increasing the retirement age affect the poverty rates of Hispanic women ages 62 and older?

Example 3

How could increasing the retirement age affect the poverty rates of Hispanic women ages 62 and older?

Via die-seite-des-dr-caligari

Motivation: Do cool stuff.

scale

maps

documents

Principles

1) Accuracy

Deliberate steps should be taken to minimize the chance of making an error and maximize the chance of catching errors when errors inevitably occur.

2) Computational reproducibility

Computational reproducibility should be embraced to improve accuracy, promote transparency, and prove the quality of analytical work.

3) Human interpretability

Code should be written so humans can easily understand what’s happening—even if it occasionally sacrifices machine performance.

4) Portability

Analyses should be designed so strangers can understand each and every step without additional instruction or inquiry from the original analyst.

5) Accessibility

Research and data are non-rival and non-exclusive. They are public goods that should be widely and easily shared. Decisions about tools, methods, data, and language during the research process should be made in ways that promote the ability of anyone and everyone to access an analysis.

6) Efficiency

Analysts should seek to make all parts of the research process more efficient with clear communication, by adopting best practices, and by managing computation.

Principles

  1. Accuracy
  2. Computational reproducibility
  3. Human interpretability
  4. Portability
  5. Accessibility
  6. Efficiency

Fundamental concepts

Text editor/IDE

  • R == free, open source programming language
  • RStudio == for-profit company and Itegrated Development Environment (IDE)

RStudio

The R console

Computational Reproducibility

  • Replication: the recreation of findings across repeated studies, is a cornerstone of science.
  • Reproducibility: the ability to access data, source code, tools, and documentation and recreate all calculations, visualizations, and artifacts of an analysis
  • Computational reproducibility should be the minimum standard for computational social sciences and statistical programming

Script

  • A plain text document that contains code and comments
  • Map to the answer
  • .R and .Rmd

Comments

  • Clear code avoids the need for describing “what”
  • Comments should focus on “why”

Coding style

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” ~ Hadley Wickham

  • CamelCase
  • camelCase
  • snake_case

tidyverse style guide

R Packages

Collections of R, C, C++, and FORTRAN code that expand the functionality of R.

CRAN

The Comprehensive R Archive Network was introduced in 1997.

Repository of popular R packages with basic standards and quality control.

tidyverse

Comprehensive set of tools for data science

Core: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

tidyverse

Free text by Hadley Wickham and Garrett Grolemund

Installing and loading packages

Data structures

Scalars (do not exist in R)

Vectors

## [1] 1 2 3 4 5

Matrices

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Data frames, multidimensional arrays

## # A tibble: 4 x 4
##   name                       awake  brainwt bodywt
##   <chr>                      <dbl>    <dbl>  <dbl>
## 1 Cheetah                     11.9 NA       50    
## 2 Owl monkey                   7    0.0155   0.48 
## 3 Mountain beaver              9.6 NA        1.35 
## 4 Greater short-tailed shrew   9.1  0.00029  0.019

Data types

Character

## [1] "a" "b" "c" "d" "e"

Numeric

## [1] 1 2 3 4 5

Logical

## [1]  TRUE  TRUE FALSE  TRUE FALSE

Factor

## [1] good ok   bad  ok   ok  
## Levels: good ok bad

Missing values

A great strength of R!

NA is R’s encoding for missing values.

Missing values are contagious.

## [1] NA

Assignment

R can hold many different objects at the same time. This requires assignment.

<-

## [1] 4
## [1] 4

Functions

Arguments by position

## [1] 2.5

Arguments by name

## [1] 2.5

Function documentation ?mean

Custom functions

Rule of three: never program something three or more times

##  [1] "odd!"  "even!" "odd!"  "even!" "odd!"  "even!" "odd!"  "even!"
##  [9] "odd!"  "even!"

Tests

What will it take to convince you that your code is correct?

  1. Assign monthly observations to fiscal years
  • Are there 12 months per year?
  1. Link observations from 2017 to observations from 2018.
  • Do non-matching variables that shouldn’t change change?
  1. Tax calculator
  • Are values that must be positive non-positive?

  • Write the test first!

  • Each time you encounter a bug, write a test that will convince you the bug no longer exists.

Organizing an analysis

Ways to learn a programming language

1: use it, use it again, use it some more.

Software check

A survey of other programming languages

Stata

  • Common users: economists, Nate Silver
  • Strengths: out-of-the-box econometric tools, simple syntax
  • Limitations: proprietary, one data set at a time, inflexible

Photo by StataCorp LP, CC BY-SA 4.0, Unaltered

SAS

  • Common users: veteran researchers, government
  • Strengths: doesn’t use memory
  • Limitations: proprietary, expensive, clunky, inflexible, lacks environments, documentation

Python

  • Users: data scientists, computer scientists
  • Strengths: general purpose programming, extensibility, flexibility
  • Weaknesses: steep learning curve

R

  • Users: statisticians, data scientists, biostatisticians
  • Strengths: extensible, documentation, community
  • Limitations: multiple languages in one

Others

  • Julia
  • Rust
  • JavaScript
  • SQL

What you use matters less than how you use it

What you use matters less than how you use it R is the best

Comparison

Source is unknown